Explainer notebook¶

This notebook is an exam hand-in for the course 02467 Computational Social Science at the Technical University of Denmark (DTU) Spring 2022 Semester.

Group members:

  • Anne Sophie Høtbjerg Hansen s194274
  • Kirstine Cort Graae s194269
  • Frederikke Bornemann Christensen s194239

All members contributed completely equally.

We had different main responsibilities:

  • Anne Sophie : Network analysis & Explainer Notebook
  • Frederikke : Text analysis & Website
  • Kirstine : Network analysis & Website

! Notice that analysis of networks and text is presented at the website !

In [1]:
# Import all nessesary libraries
import json
import numpy as np
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
from collections import Counter
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from itertools import combinations
import networkx as nx
import netwulf as nw
import community as community_louvain
from wordcloud import WordCloud, STOPWORDS
from ast import literal_eval
import shifterator as sh
In [2]:
# Define colors for books
colors = ['red', 'green', 'magenta', 'blue', 'purple', 'cyan', 'orange']

Motivation¶

All three of us are hard-core Harry Potter fans. We therefore saw this as the perfect opportunity to put all of our Harry Potter fun facts to use. Our main dataset consists of Harry Potter book summaries at chapter-level. Furthermore, we have used a dataset that contain character information to make it easy to identify characters in the summaries. We chose our main dataset based on the fact that we wanted to look at the Harry Potter books, which are much more extensive compared to the movies, but were not able to use the books directly due to copyrights. Fandom (previously known as Wikia) is a service that hosts wikis mainly on entertainment (i.e. books, movies, and tv shows). Thus, it's basically the wikipedia for entertainment and top-quality. This is the reason why we webcapped summaries from there.

Our mission is to do a deep dive into the realm of Harry Potter and examine the development in the seven books. Firstly, use social networks to look at characters. We will have a look at which characters play the most essential roles, which are connected, and what do they have in common if we partition them? Secondly, use text analysis tools to look at topics and themes. We will have a look at what the most important topics are. Furthermore, we have a theory that the books are become more and more dark, sinister, and gloomy, which we can hopefully detect via sentiment analysis. We hope that this will be fun for people who are as passionate about Harry Potter as us.

Load and clean data¶

Get summaries¶

Process and methods:

  1. Define all https to the book summaries.
  2. Use requests and Beautiful Soup to webscrape the book summaries and collect the information in a dataframe with book number, chapter number, chapter title and chapter summary.
In [3]:
# All links starts the same
root = 'https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_'
# End of link for the specific books
endpaths = ['Philosopher%27s_Stone', 'Chamber_of_Secrets', 'Prisoner_of_Azkaban', 'Goblet_of_Fire','Order_of_the_Phoenix', 'Half-Blood_Prince', 'Deathly_Hallows']
# Use list comprehention to create a list of urls
urls = [root+endpath for endpath in endpaths]
In [4]:
def GetData(urls):
    """
    Webscrabe and collect all relevant information in a dataframe
    
    """
    
    #Create empty lists to collect book number, chapter number, chapter title and chapter summary
    books = []
    chapter_numbers = []
    chapter_titles = []
    summaries = []
    
    # Scrape for each book
    for book, url in enumerate(urls):
        
        # Create soup
        req = requests.get(url)
        html = req.content
        soup = BeautifulSoup(html, 'html.parser')
        text = soup.find_all(text=True)
        output = ''
        for t in text:
            output += f'{t} '
        
        # Find headlines 
        headers = soup.find_all(['h3'])
        for head in headers:
            if head.span is not None: 
                head.span.unwrap()
        
        # Convert chapter titles to str
        chapter_headers = [head.text for head in headers if head.text[:7]=='Chapter'] 
        # Update chapter numbers
        chapter_numbers.extend([i+1 for i in range(len(chapter_headers))])
        # Update chapter names
        chapter_titles.extend([title[11:] for title in chapter_headers]) 
        # Update book number
        books.extend([book+1 for i in range(len(chapter_headers))])

        # Retrive the relevent text
        # When book summaries end 
        # they are always followed by either "List of spells first introduced" or "List of deaths"
        HeadlineNames = chapter_headers + ['List of spells first introduced', 'List of deaths'] 
        
        # Unwrap text and headlines
        Texts = soup.find_all(['p', 'h3', 'h2'])
        for section in Texts: 
            if section is not None:
                if section.a is not None: 
                    section.a.unwrap()
                elif section.span is not None: 
                    section.span.unwrap()

        # Store text in a list
        Texts = [t.text for t in Texts]

        # Find section indices in Texts
        CaptionIndices = []
        for caption in HeadlineNames:
            if caption in Texts:
                CaptionIndices.append(Texts.index(caption))
        CaptionIndices = CaptionIndices[:len(chapter_headers)+1]
        # Store texts according to section
        SectionText = []
        for i in range(len(CaptionIndices)-1):
            SectionText.append(Texts[CaptionIndices[i]+1:CaptionIndices[i+1]])
        summaries.extend(SectionText)

    # Fill dataframe
    df = pd.DataFrame({"book":books, "chapter_number": chapter_numbers,"chapter_title": chapter_titles,"summary": summaries})
    
    return df
In [5]:
# Dataframe with summary information
df = GetData(urls)
df
Out[5]:
book chapter_number chapter_title summary
0 1 1 The Boy Who Lived [Vernon and Petunia Dursley, of Number Four P...
1 1 2 The Vanishing Glass [Dudley counting his presents, Ten years pass ...
2 1 3 The Letters from No One [Hundreds of letters arriving at the fireplace...
3 1 4 The Keeper of the Keys [Rubeus Hagrid enters the cabin, There is anot...
4 1 5 Diagon Alley [Ollivander's Wand Shop, When Harry wakes the ...
... ... ... ... ...
193 7 32 The Elder Wand [Voldemort and the Elder Wand, Harry, Hermione...
194 7 33 The Prince's Tale [Snape's memories, Harry dives into Snape's me...
195 7 34 The Forest Again [Harry's mother comforting him before his "dea...
196 7 35 King's Cross [Harry during his "death", However, Harry find...
197 7 36 The Flaw in the Plan [Hagrid carrying Harry's body, Back in the for...

198 rows × 4 columns

Get characters¶

Process and methods:

  1. Load file found on the internet about character information.
  2. We need to be able to identify characters on their first- and last name alone as they rarely are mentioned by their full name (often only once when they are introduced). Therefore, two columns are added to the dataframe with first- and last name separately. The characters that only go by one name will have NKLN (No Known Last Name) in the last name column.
  3. Fill the alternate names column (nick names) with NKAN if character has No Known Alternate Name.
  4. Fill the house column (house affiliation) with NKH if character has No Known House.
In [6]:
# Load file with characters and their attributes as dataframe
characters = pd.read_json('characters.json')
# Remove pictures of the characters from the dataframe
characters.pop('image')
# Dataframe with character information
characters
Out[6]:
name alternate_names species gender house dateOfBirth yearOfBirth wizard ancestry eyeColour hairColour wand patronus hogwartsStudent hogwartsStaff actor alternate_actors alive
0 Harry Potter [] human male Gryffindor 31-07-1980 1980 True half-blood green black {'wood': 'holly', 'core': 'phoenix feather', '... stag True False Daniel Radcliffe [] True
1 Hermione Granger [] human female Gryffindor 19-09-1979 1979 True muggleborn brown brown {'wood': 'vine', 'core': 'dragon heartstring',... otter True False Emma Watson [] True
2 Ron Weasley [Dragomir Despard] human male Gryffindor 01-03-1980 1980 True pure-blood blue red {'wood': 'willow', 'core': 'unicorn tail-hair'... Jack Russell terrier True False Rupert Grint [] True
3 Draco Malfoy [] human male Slytherin 05-06-1980 1980 True pure-blood grey blonde {'wood': 'hawthorn', 'core': 'unicorn tail-hai... True False Tom Felton [] True
4 Minerva McGonagall [] human female Gryffindor 04-10-1925 1925 True black {'wood': '', 'core': '', 'length': ''} tabby cat False True Dame Maggie Smith [] True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
398 Albus Severus Potter [Al] human male Slytherin True half-blood green black {'wood': '', 'core': '', 'length': ''} True False Arthur Bowen [] True
399 Rose Weasley [] human female Gryffindor True half-blood red {'wood': '', 'core': '', 'length': ''} True False Helena Barlow [] True
400 Hugo Weasley [] human male True half-blood brown {'wood': '', 'core': '', 'length': ''} True False Ryan Turner [] True
401 Scorpius Malfoy [Scorpius Hyperion Malfoy] human male Slytherin True pure-blood grey blond {'wood': '', 'core': '', 'length': ''} True False Bertie Gilbert [] True
402 Victoire Weasley [] human female True blonde {'wood': '', 'core': '', 'length': ''} True False [] True

403 rows × 18 columns

In [7]:
def FirstLastNames(names):
    """
    Function to find first and last name (if possible) for all characters
    
    """
    
    # Make empty lists to contain the names
    first_names = []
    last_names = []
    
    # Go through all names
    for name in names:
        # If there is a space in the name, we have both a first and last name
        if ' ' in name:
            split = name.split(' ')
            # If the character has more than one last name
            if len(split) >= 3:
                # Add first name 
                first_names.append(split[0])
                last_name = ''
                # Add all last names to the last name list
                for part_name in split[1:]:
                    last_name = (last_name+' '+part_name)
                last_names.append(last_name)
            # If the character only has 1 first and last name
            else:
                first_names.append(split[0])
                last_names.append(split[1])
        # If no last name is known add 'No Known Last Name' =NKLN
        else:
            first_names.append(name)
            last_names.append('NKLN')
            
    return first_names, last_names

def NickNames(alternate_name):
    """
    Makes it possible to fill column with NKAN if character has No Known Alternate Name
    
    """
    names = []
    for name in alternate_name:
        # If no alternate name is known insert NKAN (No Known Alternate Name)
        if name == []:
            names.append('NKAN')
        # Add the alternate name as a str instead of list
        else:
            names.append(name[0])
    return names

def HogwartsHouse(houses):
    """
    Makes it possible fill column with NKH if character has No Known House
    
    """
    # If house is not assigned replace empty str with NKH (No Known House)
    fill_houses = ["NKH" if x == '' else x for x in houses]
    return fill_houses

def SpellNameCorrectly(characters, incorrect_name, correct_name):
    """
    Correct name of character if spelled incorrectly
    
    """
    characters.replace(incorrect_name, correct_name)
In [8]:
def CleanCharactersDataframe(characters):
    """
    Apply functions FirstLastNames, NickNames, HogwartsHouse and SpellNameCorrectly to dataframe
    
    """
    # We noticed that Quirinus Quirrell was spelled differently in the two files
    SpellNameCorrectly(characters,'Quirinus Quirrel','Quirinus Quirrell' )
    # Add columns with first- and last name to dataframe
    characters['first_names'], characters['last_names'] = FirstLastNames(list(characters['name']))
    characters.insert(1, 'first_names', characters.pop('first_names'))
    characters.insert(2, 'last_names', characters.pop('last_names'))
    # Add column with their alternate names
    characters['alternate_names'] = NickNames(list(characters['alternate_names']))
    # Add column with their house
    characters['house'] = HogwartsHouse(list(characters['house']))
    return characters
In [9]:
characters = CleanCharactersDataframe(characters)
In [10]:
# Dataframe with cleaned character information
characters
Out[10]:
name first_names last_names alternate_names species gender house dateOfBirth yearOfBirth wizard ancestry eyeColour hairColour wand patronus hogwartsStudent hogwartsStaff actor alternate_actors alive
0 Harry Potter Harry Potter NKAN human male Gryffindor 31-07-1980 1980 True half-blood green black {'wood': 'holly', 'core': 'phoenix feather', '... stag True False Daniel Radcliffe [] True
1 Hermione Granger Hermione Granger NKAN human female Gryffindor 19-09-1979 1979 True muggleborn brown brown {'wood': 'vine', 'core': 'dragon heartstring',... otter True False Emma Watson [] True
2 Ron Weasley Ron Weasley Dragomir Despard human male Gryffindor 01-03-1980 1980 True pure-blood blue red {'wood': 'willow', 'core': 'unicorn tail-hair'... Jack Russell terrier True False Rupert Grint [] True
3 Draco Malfoy Draco Malfoy NKAN human male Slytherin 05-06-1980 1980 True pure-blood grey blonde {'wood': 'hawthorn', 'core': 'unicorn tail-hai... True False Tom Felton [] True
4 Minerva McGonagall Minerva McGonagall NKAN human female Gryffindor 04-10-1925 1925 True black {'wood': '', 'core': '', 'length': ''} tabby cat False True Dame Maggie Smith [] True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
398 Albus Severus Potter Albus Severus Potter Al human male Slytherin True half-blood green black {'wood': '', 'core': '', 'length': ''} True False Arthur Bowen [] True
399 Rose Weasley Rose Weasley NKAN human female Gryffindor True half-blood red {'wood': '', 'core': '', 'length': ''} True False Helena Barlow [] True
400 Hugo Weasley Hugo Weasley NKAN human male NKH True half-blood brown {'wood': '', 'core': '', 'length': ''} True False Ryan Turner [] True
401 Scorpius Malfoy Scorpius Malfoy Scorpius Hyperion Malfoy human male Slytherin True pure-blood grey blond {'wood': '', 'core': '', 'length': ''} True False Bertie Gilbert [] True
402 Victoire Weasley Victoire Weasley NKAN human female NKH True blonde {'wood': '', 'core': '', 'length': ''} True False [] True

403 rows × 20 columns

Clean¶

Process and methods:

  1. Clean the summaries where we use \W (regular expression) to split the text on anything other than a word character.
  2. Fill summary column with the cleaned summaries to our main dataframe.
In [11]:
def CleanSummaries(df):
    """
    Cleans the summaries
    
    """
    # Create for cleaned summaries 
    summary = []
    for i in range(len(df)):
        text = df['summary'].iloc[i]
        # Remove newline
        text = [t.strip() for t in text]
        # Split in non-sign tokens
        tokens = []
        for t in text: 
            tokens = tokens + re.split(r'\W', t)
        # Join tokens 
        text = ' '.join(tokens)
        text = text.replace(' s ', ' ')
        text = text.replace(' t ', ' ')
        text = text.replace('  ', ' ')
        # Append cleaned text to summary
        summary.append(text)
    return summary
In [12]:
%%capture
#Only run once - otherwise it will be a little too tokenized if u catch my drift ;)
df['summary'] = CleanSummaries(df)
df['plot_tokens'] = [row['summary'].split(" ") for _,row in df.iterrows()]

# save df to csv file
df.to_csv('plot_summary_df.csv')
In [13]:
# Dataframe with cleaned and tokenized plot summaries
df
Out[13]:
book chapter_number chapter_title summary plot_tokens
0 1 1 The Boy Who Lived Vernon and Petunia Dursley of Number Four Priv... [Vernon, and, Petunia, Dursley, of, Number, Fo...
1 1 2 The Vanishing Glass Dudley counting his presents Ten years pass si... [Dudley, counting, his, presents, Ten, years, ...
2 1 3 The Letters from No One Hundreds of letters arriving at the fireplace ... [Hundreds, of, letters, arriving, at, the, fir...
3 1 4 The Keeper of the Keys Rubeus Hagrid enters the cabin There is anothe... [Rubeus, Hagrid, enters, the, cabin, There, is...
4 1 5 Diagon Alley Ollivander Wand Shop When Harry wakes the next... [Ollivander, Wand, Shop, When, Harry, wakes, t...
... ... ... ... ... ...
193 7 32 The Elder Wand Voldemort and the Elder Wand Harry Hermione an... [Voldemort, and, the, Elder, Wand, Harry, Herm...
194 7 33 The Prince's Tale Snape memories Harry dives into Snape memories... [Snape, memories, Harry, dives, into, Snape, m...
195 7 34 The Forest Again Harry mother comforting him before his death H... [Harry, mother, comforting, him, before, his, ...
196 7 35 King's Cross Harry during his death However Harry finds him... [Harry, during, his, death, However, Harry, fi...
197 7 36 The Flaw in the Plan Hagrid carrying Harry body Back in the forest ... [Hagrid, carrying, Harry, body, Back, in, the,...

198 rows × 5 columns

Tools, theory and analysis¶

Notice that most analysis can be found on the website and not in this Notebook!

Networks¶

Process and methods:

  1. Find connections between characters - they are connected if mentioned in the same chapter. We have defined prefixes that identify characters with common prefixes such as Mrs. Norris. We have defined location contain names that must not me confused with a character. We have also defined he houses that share names with the last name of characters. Furthermore, we take into account that Tom Riddle is not an idividual character and should be associated with Lord Voldemort.
  2. Exclude characters not mentioned from our character-information dataframe.
  3. Find every time a character is mentioned in a chapter. This is nessesary as it will make us able to compute node weight. Otherwise, our GetNodeWeights-function would return the number of chapters a character is mentioned in and not the total amount of times.
  4. Edges: Compute which characters are connected two-and-two.
  5. Edge weights: Compute number of times two characters co-appear in a summary.
  6. Node weights: Compute number of times the characters are mentioned.
In [14]:
def FindConnectedCharacters(df, name, first_names, last_names, alternate_names): 
    """
    Detect connected characters
    
    """
    # Prefixes that identify characters with common prefixes
    exeptions = ['Mrs','Mr','Fat','Nearly','Sir','The','Bloody','Moaning','Dr','Madam','Wizard']
    # Some locations contain names that must not be confused with the character
    places = ['Godric Hollow','Slytherin Chamber','Myrtle Bathroom','Malfoy Manor','Weasly Burrow','Hagrid Hut']
    # The houses also share names with the last name of character    
    houses = ['Gryffindor','Hufflepuff','Ravenclaw','Slytherin']

    # Get connections
    Connections = {}
    for _,row in df.iterrows(): #all texts 
        TextEdges = []
        mentioned = []
        for i in range(len(first_names)):
            all_names = [name[i],first_names[i],last_names[i],alternate_names[i]]
            if first_names[i] in exeptions:
                if name[i] in row['summary']:
                    TextEdges.append(name[i])
                else:
                    continue
            elif last_names[i] in houses:
                if first_names[i] in row['summary']:
                    TextEdges.append(name[i])
                else:
                    continue
            else:        
                for split_name in all_names:
                    if split_name in row['summary'].split(' '):
                        if split_name in mentioned:
                            break
                        # Tom Riddle is not an individual character and should be associated with Voldemort.
                        if name[i] == 'Tom Riddle':
                            TextEdges.append('Lord Voldemort')
                            mentioned.append('Lord Voldemort')
                            mentioned.append('Lord')
                            mentioned.append('Voldemort')
                        else:
                            TextEdges.append(name[i])
                            mentioned.append(name[i])
                            mentioned.append(first_names[i])
                            mentioned.append(last_names[i])
                        break
                    else:
                        continue
        
        # Store connection 
        Connections[(row['book'], row['chapter_number'], row['chapter_title'])] = set(TextEdges)            
    
    return Connections
In [15]:
Connections = FindConnectedCharacters(df,characters['name'],characters['first_names'],characters['last_names'],characters['alternate_names'])
In [16]:
# Find unique characters mentioned
unique_characters = list(Connections.values())
unique_characters = [item for sublist in unique_characters for item in sublist]
# Make dataframe where only the mentioned unique characters are in (to minimize mistakes)
characters_p = characters[characters['name'].isin(unique_characters)]
characters_p = characters_p.reset_index(drop=True)
In [17]:
def ConnectionsCounted(df, name, first_name, last_name, alternate_name): 
    
    """
    Count number of times unique characters are mentioned
    
    """
    
    # These prefixes are irrelevant
    exeptions = ['Mrs','Mr','Fat','Nearly','Sir','The','Bloody','Moaning','Dr','Madam','Wizard']
    # Some locations contain names are must not me confused with the character
    places = ['Godric Hollow','Slytherin Chamber','Myrtle Bathroom','Malfoy Manor','Weasly Burrow','Hagrid Hut']
    # The houses also share names with the last name of character    
    houses = ['Gryffindor','Hufflepuff','Ravenclaw','Slytherin']

    # Get connections
    Connections = {}
    
    for _,row in df.iterrows():
        
        TextEdges = []
        mentioned = []
        tokens = row['summary'].split(' ')
        
        for j in range(len(first_name)):
            if first_name[j] in exeptions:
                if name[j] in row['summary']:
                    TextEdges.append([name[j]]*row['summary'].count(name[j]))
                    continue
                else:
                    continue
            elif last_name[j] in houses:
                if first_name[j] in row['summary']:
                    TextEdges.append([name[j]]*row['summary'].count(name[j]))
                    continue
                else:
                    continue
                    
            for i in range(len(tokens)):
                if tokens[i] == first_name[j]:
                    TextEdges.append([name[j]])
                    mentioned.append(first_name[j])
                    mentioned.append(last_name[j])
                    mentioned.append(name[j])
                    
                if tokens[i] == last_name[j]:
                    if first_name[j] in tokens[i-4:i]:
                        continue
                    else:
                        if name[j] in mentioned:
                            TextEdges.append([name[j]])
                        elif first_name[j] not in mentioned:
                            if last_name[j] in mentioned:
                                continue
                            else:
                                TextEdges.append([name[j]])
                                mentioned.append(first_name[j])
                                mentioned.append(last_name[j])
                                mentioned.append(name[j])
                                continue
                else:
                    continue
        
        # Store connection 
        Connections[(row['book'],row['chapter_number'], row['chapter_title'])] = TextEdges            
    
    return Connections
In [18]:
Connections_count = ConnectionsCounted(df,characters_p['name'],characters_p['first_names'],characters_p['last_names'],characters_p['alternate_names'])
In [19]:
def UnNest(dictionary):
    """
    Unnest dictionary
    
    """
    keys = list(dictionary.keys())
    for key in keys:
        value = dictionary[key] 
        dictionary[key] = [item for sublist in value for item in sublist]
    return dictionary
In [20]:
Connections_count = UnNest(Connections_count)
In [21]:
def GetEdges(Connections): 
    
    """
    Compute which characters are connected two-and-two
    
    """
    
    # Get all edges
    Edges = []
    for g in Connections.values(): 
        Edges += [i for i in combinations(g, 2)] #For each fully connected subgraph, add all links in that graph 

    # Remove double. tripple etc. edges 
    Edges = [tuple(sorted(edge)) for edge in Edges]
    Edges = set(Edges)
    
    return Edges
In [22]:
Edges = GetEdges(Connections)
In [23]:
def GetLinkWeights(Edges, Connections):
    """
    Compute number of times two characters co-appear in a summary
    
    """
    
    Weights = {edge: 0 for edge in Edges}
    for edge in Weights.keys(): 
        w = 0
        for s in Connections.values(): 
            if (edge[0] in s) and (edge[1] in s): 
                w +=1
        Weights[edge] = w

    WeightsInput = [(key[0],key[1], {'weight': val}) for key, val in Weights.items()] 
    
    return WeightsInput
In [24]:
def GetNodeWeights(Connections):
    """
    Compute number of times the characters are mentioned
    
    """
    
    unique_characters = list(Connections.values())
    unique_characters = [item for sublist in unique_characters for item in sublist]
    count = Counter(unique_characters)
    count = dict(sorted(count.items(), key=lambda item: item[1],reverse = True))
    return count
In [25]:
NodeWeights = GetNodeWeights(Connections)

Split into individual books¶

Split the data into individual books as well and not only the entire book series.

In [26]:
# Unweighted
for i in range(1,8):
    keys = [k for k in Connections if k[0] ==i]
    globals()[f"UWC{i}"] =  {your_key: Connections[your_key] for your_key in keys}
In [27]:
# Weighted
for i in range(1,8):
    keys = [k for k in Connections_count if k[0] ==i]
    globals()[f"WC{i}"] =  {your_key: Connections_count[your_key] for your_key in keys}
In [28]:
# Edges for the books individually and the book series

Edges = GetEdges(Connections)

print("# Edges in all books is", len(Edges), ". Thus, there are", len(Edges), "character relations mentioned in the summaries.")

# Get number of edges for each book
E1 = len(GetEdges(UWC1))
E2 = len(GetEdges(UWC2))
E3 = len(GetEdges(UWC3))
E4 = len(GetEdges(UWC4))
E5 = len(GetEdges(UWC5))
E6 = len(GetEdges(UWC6))
E7 = len(GetEdges(UWC7))

# Create a dataset
h = [E1, E2, E3, E4, E5, E6, E7]
b = ('Book1', 'Book2', 'Book3', 'Book4', 'Book5', 'Book6', 'Book7')
x_pos = np.arange(len(b))

# Create bars with different colors
plt.bar(x_pos, h, color=colors)

# Create names on the x-axis
plt.xticks(x_pos, b)

# Create title name
plt.title("Edges for each book individually")

# Show graph
plt.show()
# Edges in all books is 3283 . Thus, there are 3283 character relations mentioned in the summaries.
  • Books 1, 2, 3, and 6 mainly revolve around life at school with no new additions to the school life really. Therefore, not many new relationships are introduced across the board.
  • Book 4 adds life to the school due to the Triwizard Tournament. Additionally, the characters have also become a little more adult in this book and start to date.
  • Book 5 introduces the Order of the Phoenix and its members which are connected to not only each other but also makes it common for lots of people to appear in chapters with lots of other people instead of only Harry, Ron and Hermonie.
  • Book 7 is the showdown between Good and Evil which contain a massive battle where characters from every side of the spectrum (that wouldn't necessarily have met otherwise) battle or help each other.
In [29]:
# Nodes for the books individually and the book series

Nodes = GetNodeWeights(Connections)

print("# Nodes in all books is", len(Nodes), ". Thus, there are", len(Nodes), "unique characters in the summaries.")

# Get number of nodes for each book
N1 = len(GetNodeWeights(UWC1))
N2 = len(GetNodeWeights(UWC2))
N3 = len(GetNodeWeights(UWC3))
N4 = len(GetNodeWeights(UWC4))
N5 = len(GetNodeWeights(UWC5))
N6 = len(GetNodeWeights(UWC6))
N7 = len(GetNodeWeights(UWC7))

# Create a dataset
h = [N1, N2, N3, N4, N5, N6, N7]
b = ('Book1', 'Book2', 'Book3', 'Book4', 'Book5', 'Book6', 'Book7')
x_pos = np.arange(len(b))

# Create bars with different colors
plt.bar(x_pos, h, color=colors)

# Create names on the x-axis
plt.xticks(x_pos, b)

# Create title name
plt.title("Nodes for each book individually")

# Show graph
plt.show()
# Nodes in all books is 183 . Thus, there are 183 unique characters in the summaries.
  • Books 1, 2, and 3 mainly revolve around life at school. They each have characters that only appear in that specific book. For instance, it's very known that the Defence Against the Dark Arts teacher is replaced quite frequently (yearly). Other than that they mainly consist of core characters that appear in every book.
  • Book 4 also revolves around the school but characters from other schools are introduced due to the Triwizard Tournament.
  • Book 5 introduces the Order of the Phoenix and its members. Furthermore, Death Eaters, who previously only were known as a collective of wizards that support Lord Voldemort, are now mentioned by name. Additionally, the Ministry of Magic plays a more important role as some of the book also takes place there for the first time.
  • Book 6 maintains from Book 5 that Death Eaters are mentioned more a individuals.
  • Book 7 is to no surprise the book with most unique characters. It basically revolves around the battle between Good and Evil, and both sides have lots of characters.

Graphs¶

The aim of the network analysis was to locate communities in the dataset. Furthermore, the analysis should make it possible to identify the main characters of every book. The Louvain method for community detection is used to partion characters into communities. The Louvain method computes the partition of all nodes in respect to the partition which maximises the modularity by the use of Louvain heuristics. Unweighted graphs for the book series as a whole and the individual books in combination with Louvain communities will be visualized and analysed. Based on these visualizations, it will be discussed whether the communities make sense in relation to the plots of every book. For us, it will be fun to see if the Louvain community detection method supports a pattern that we, as fans, can see makes sense. Furthermore, modularity will be used to measure the the density of connections within a community. High modularity indicates dense connections between the nodes within communities but sparse connections between nodes in different modules. Modularity is in the interval [-1.0,1.0].

Weighted graphs for the book series as a whole and the individual books will also be visualized and analysed. They make it possible to determine the main characters in the book(s). This requires that the node weight of each character is computed. Notice that this node weight is calculated by counting how many times each character is mentioned in total. From this method, one should be able to detect the most important characters of the book(s) based on the specific mention count. To the suprise over everyone, we firmly believe that Harry Potter comes out as the winner at that point ;)

In [30]:
def UnweightedGraph(Connections):
    """
    Uses NetworkX to build an unweighted graph. 
    
    """
    
    # Compute links: 
    Edges = GetEdges(Connections)

    # Initialize
    G = nx.Graph()

    # Add nodes
    G.add_nodes_from(set().union(*Connections.values()), LouvainPartition = None, group = None)

    # Add edges to graph 
    G.add_edges_from(Edges)  

    return G
In [31]:
def GraphModularity(graph, p):
    """
    Computes modularity 
    
    """
    
    # Number of links in graph:
    L = len(graph.edges())

    M = 0
    key = p[0] 
    p = p[1]
    deg = {i:0 for i in p}
    links = {i:0 for i in p}

    # Loop through graph nodes
    for node in graph: 
        par = graph.nodes[node][key] # Node partition 
        deg[par] += graph.degree[node] # Node degree

        l = sum([w.get('weight', 1)/2 for n,w in graph[node].items() if graph.nodes[node][key] == graph.nodes[n][key]])

        links[par] += l

    for par in p: 
        M += links[par]/L - (deg[par]/(2*L))**2

    return M
In [32]:
def WeightedGraph(data, Connections, Nodes): 
    """
    Uses NetworkX to build a Weighted Graph. 
    
    """
 
    # Compute node weights
    node_weight = GetNodeWeights(Connections)
    
    # Find links: 
    Edges = GetEdges(Connections)
    
    # Compute link weight
    link_weight = GetLinkWeights(Edges, Connections)
    
    # Initialise weighted graph
    G_W = nx.Graph()

    # Add nodes
    G_W.add_nodes_from(set().union(*Connections.values()), LouvainPartition = None, group = None, size = None)

    # Add node size (weight)
    for key, val in node_weight.items(): 
        G_W.nodes[key]['size'] = val

    # Add links and weights
    G_W.add_edges_from(link_weight)
    
    return G_W

Entire book¶

In [33]:
# Undirected graph
G_UW = UnweightedGraph(Connections)

# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G_UW)

# Add this to graph
nx.set_node_attributes(G_UW, LouvainCommunities, 'group')

# Interactive graph
#UWGraphVisu, _ = nw.visualize(G_UW)

# Give communities an id
community_id = [LouvainCommunities[node] for node in G_UW.nodes()]

# Visualise graph 
fig = plt.figure(figsize=(25,20))
nx.draw(G_UW,
       edge_color = 'lightgrey',
       cmap = plt.cm.PuRd,
       node_color = community_id,
       node_size = 500,
       with_labels = True,
       edgecolors = 'black')

# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M = GraphModularity(G_UW, partition)
print("Modularity of unweighted graph is ", str(M))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-33-698d91d80468> in <module>
      3 
      4 # Compute the best partition
----> 5 LouvainCommunities = community_louvain.best_partition(G_UW)
      6 
      7 # Add this to graph

AttributeError: module 'community' has no attribute 'best_partition'
In [ ]:
G_W = WeightedGraph(characters_p, Connections_count, Nodes)

WGraphVisu, _ = nw.visualize(G_W)

WG.png

Book 1¶

In [ ]:
# Undirected graph
G1_UW = UnweightedGraph(UWC1)

# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G1_UW)

# Add this to graph
nx.set_node_attributes(G1_UW, LouvainCommunities, 'group')

# Interactive graph 
#UWGraphVisu1, _ = nw.visualize(G1_UW)

# Give communities an id
community_id = [LouvainCommunities[node] for node in G1_UW.nodes()]

# Visualise graph 
fig = plt.figure(figsize=(25,20))
nx.draw(G1_UW,
       edge_color = 'lightgrey',
       cmap = plt.cm.PuRd,
       node_color = community_id,
       node_size = 500,
       with_labels = True,
       edgecolors = 'black')

# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M1 = GraphModularity(G1_UW, partition)
print("Modularity of unweighted graph is ", str(M1))
In [ ]:
G1_W = WeightedGraph(characters_p, WC1, Nodes)

#WGraphVisu1, _ = nw.visualize(G1_W)

WG1.png

Book 2¶

In [ ]:
# Undirected graph
G2_UW = UnweightedGraph(UWC2)

# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G2_UW)

# Add this to graph
nx.set_node_attributes(G2_UW, LouvainCommunities, 'group')

# Interactive graph 
#UWGraphVisu2, _ = nw.visualize(G2_UW)

# Give communities an id
community_id = [LouvainCommunities[node] for node in G2_UW.nodes()]

# Visualise graph 
fig = plt.figure(figsize=(25,20))
nx.draw(G2_UW,
       edge_color = 'lightgrey',
       cmap = plt.cm.PuRd,
       node_color = community_id,
       node_size = 500,
       with_labels = True,
       edgecolors = 'black')

# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M2 = GraphModularity(G2_UW, partition)
print("Modularity of unweighted graph is ", str(M2))
In [ ]:
G2_W = WeightedGraph(characters_p, WC2, Nodes)

#WGraphVisu2, _ = nw.visualize(G2_W)

WG2.png

Book 3¶

In [ ]:
# Undirected graph
G3_UW = UnweightedGraph(UWC3)

# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G3_UW)

# Add this to graph
nx.set_node_attributes(G3_UW, LouvainCommunities, 'group')

# Interactive graph 
#UWGraphVisu3, _ = nw.visualize(G3_UW)

# Give communities an id
community_id = [LouvainCommunities[node] for node in G3_UW.nodes()]

# Visualise graph 
fig = plt.figure(figsize=(25,20))
nx.draw(G3_UW,
       edge_color = 'lightgrey',
       cmap = plt.cm.PuRd,
       node_color = community_id,
       node_size = 500,
       with_labels = True,
       edgecolors = 'black')

# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M3 = GraphModularity(G3_UW, partition)
print("Modularity of unweighted graph is ", str(M3))
In [ ]:
G3_W = WeightedGraph(characters_p, WC3, Nodes)

#WGraphVisu3, _ = nw.visualize(G3_W)

WG3.png

Book 4¶

In [ ]:
# Undirected graph
G4_UW = UnweightedGraph(UWC4)

# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G4_UW)

# Add this to graph
nx.set_node_attributes(G4_UW, LouvainCommunities, 'group')

# Interactive graph 
#UWGraphVisu4, _ = nw.visualize(G4_UW)

# Give communities an id
community_id = [LouvainCommunities[node] for node in G4_UW.nodes()]

# Visualise graph 
fig = plt.figure(figsize=(25,20))
nx.draw(G4_UW,
       edge_color = 'lightgrey',
       cmap = plt.cm.PuRd,
       node_color = community_id,
       node_size = 500,
       with_labels = True,
       edgecolors = 'black')

# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M4 = GraphModularity(G4_UW, partition)
print("Modularity of unweighted graph is ", str(M4))
In [ ]:
G4_W = WeightedGraph(characters_p, WC4, Nodes)

#WGraphVisu4, _ = nw.visualize(G4_W)

WG4.png

Book 5¶

In [ ]:
# Undirected graph
G5_UW = UnweightedGraph(UWC5)

# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G5_UW)

# Add this to graph
nx.set_node_attributes(G5_UW, LouvainCommunities, 'group')

# Interactive graph 
#UWGraphVisu5, _ = nw.visualize(G5_UW)

# Give communities an id
community_id = [LouvainCommunities[node] for node in G5_UW.nodes()]

# Visualise graph 
fig = plt.figure(figsize=(25,20))
nx.draw(G5_UW,
       edge_color = 'lightgrey',
       cmap = plt.cm.PuRd,
       node_color = community_id,
       node_size = 500,
       with_labels = True,
       edgecolors = 'black')

# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M5 = GraphModularity(G5_UW, partition)
print("Modularity of unweighted graph is ", str(M5))
In [ ]:
G5_W = WeightedGraph(characters_p, WC5, Nodes)

#WGraphVisu5, _ = nw.visualize(G5_W)

WG5.png

Book 6¶

In [ ]:
# Undirected graph
G6_UW = UnweightedGraph(UWC6)

# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G6_UW)

# Add this to graph
nx.set_node_attributes(G6_UW, LouvainCommunities, 'group')

# Interactive graph 
#UWGraphVisu6, _ = nw.visualize(G6_UW)

# Give communities an id
community_id = [LouvainCommunities[node] for node in G6_UW.nodes()]

# Visualise graph 
fig = plt.figure(figsize=(25,20))
nx.draw(G6_UW,
       edge_color = 'lightgrey',
       cmap = plt.cm.PuRd,
       node_color = community_id,
       node_size = 500,
       with_labels = True,
       edgecolors = 'black')

# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M6 = GraphModularity(G6_UW, partition)
print("Modularity of unweighted graph is ", str(M6))
In [ ]:
G6_W = WeightedGraph(characters_p, WC6, Nodes)

#WGraphVisu6, _ = nw.visualize(G6_W)

WG6.png

Book 7¶

In [ ]:
# Undirected graph
G7_UW = UnweightedGraph(UWC7)

# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G7_UW)

# Add this to graph
nx.set_node_attributes(G7_UW, LouvainCommunities, 'group')

# Interactive graph 
#UWGraphVisu7, _ = nw.visualize(G7_UW)

# Give communities an id
community_id = [LouvainCommunities[node] for node in G7_UW.nodes()]

# Visualise graph 
fig = plt.figure(figsize=(25,20))
nx.draw(G7_UW,
       edge_color = 'lightgrey',
       cmap = plt.cm.PuRd,
       node_color = community_id,
       node_size = 500,
       with_labels = True,
       edgecolors = 'black')

# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M7 = GraphModularity(G7_UW, partition)
print("Modularity of unweighted graph is ", str(M7))
In [ ]:
G7_W = WeightedGraph(characters_p, WC7, Nodes)

#WGraphVisu7, _ = nw.visualize(G7_W)

WG7.png

In [ ]:
# Modularity plots

# Create a dataset
h = [M, M1, M2, M3, M4, M5, M6, M7]
b = ('All', 'Book1', 'Book2', 'Book3', 'Book4', 'Book5', 'Book6', 'Book7')
x_pos = np.arange(len(b))

# Create bars with different colors
# Add color for entire book to our color scheme
special_colors = ['grey', 'red', 'green', 'magenta', 'blue', 'purple', 'cyan', 'orange'] 
plt.bar(x_pos, h, color=special_colors)

# Create names on the x-axis
plt.xticks(x_pos, b)

# Create title name
plt.title("Modularity for the entire book series and each book individually")

# Show graph
plt.show()

People definitely interact with each other across communities which this plot supports as the modularity appears relatively neutral. In Books 5-7 the battle between Good (Harry) and Evil (Voldemort) becomes more clear so does the tendency to interact more closely with one's community.

WordClouds¶

WordClouds are used to help us identify the most unique and important topics in the book series and each individual book.

Process and method:

  1. Make the tokens of the plot summaries lower case
  2. Create a WordCloud for the entire book series based only on term-frequency.
  3. Create a WordCloud per book based on term frequency - inverse document frequency (TF-IDF)
In [ ]:
# Make tokens lower case
lower_tokens = []
for index, row in df.iterrows():
    lower_tokens.append([token.lower() for token in row['plot_tokens']])
df['tokens_lower'] = lower_tokens
df

Full series wordcloud¶

In [ ]:
# Create WordCloud for entire book
full_text = " "
for index, row in df.iterrows():
    full_text += row['summary'].lower() + " "

wordcloud = WordCloud(width = 800, height = 800,
                    background_color ='white',
                    min_font_size = 10).generate(full_text)

# Plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

Individual book wordclouds with TF-IDF¶

In [ ]:
# Make dictionary with tokens as list
docs = {i: {'tokens':[]} for i in range(1,8)}
for index, row in df.iterrows():
    docs[row['book']]['tokens'].extend(row['tokens_lower'])

# Make each list into a nltk Text
for book in docs.keys():
    docs[book]['text'] = nltk.Text(docs[book]['tokens']) 

# For each book
for book in docs.keys():
    # Create document from text
    doc = docs[book]['text']
    # Frequency distribution 
    fdist = nltk.FreqDist(doc)
    # Find unique word in text
    docs[book]['unique_words'] = [tup for tup in fdist]
    # Find term frequency
    docs[book]['TF'] = [fdist[word]/len(doc) for word in docs[book]['unique_words']]
    # Connect term frequency with the word in dictionary
    docs[book]['TF_word'] = Counter({word: docs[book]['TF'][i] for i, word in enumerate(docs[book]['unique_words'])})
    
# How many of the documents are each word in?
cumulative_unique = [] 
for book in docs.keys():
    cumulative_unique.extend(docs[book]['unique_words'])
# Make into nltk Text
cumulative_unique = nltk.Text(cumulative_unique)
# Frequency distribution 
fdist_allwords = nltk.FreqDist(cumulative_unique)

N = 7 #Number of books
# For each book
for book in docs.keys():
    # Inverse document frequency (IDF)
    docs[book]['IDF'] = [np.log10(N/fdist_allwords[word]) for word in docs[book]['unique_words']]
    # Term frequency-inverse document frequency (TF-IDF)
    docs[book]['TF-IDF'] = [docs[book]['TF'][i]*docs[book]['IDF'][i] for i in range(len(docs[book]['unique_words']))]
    # Connect term frequency-inverse document frequency with the word in a dictionary
    docs[book]['TF-IDF_word'] = Counter({word: docs[book]['TF-IDF'][i] for i, word in enumerate(docs[book]['unique_words'])})
In [ ]:
# Create a WordCloud per book based on term frequency-inverse document frequency (TF-IDF)
for book in docs.keys():
    wordcloud = WordCloud(width = 800, height = 800,
                    background_color ='white',
                    min_font_size = 10).generate_from_frequencies(docs[book]['TF-IDF_word'])

    # Plot the WordCloud                       
    plt.figure(figsize = (8, 8), facecolor = None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad = 0)
    plt.show()

Sentiment analysis¶

Sentiment analysis will help us test our theory on whether the books become more and more dark, sinister, and gloomy - both in relation to book depth and the book series as a whole. We will use the labMT happiness scores from https://hedonometer.org/words/labMT-en-v2/ (downloaded as a .csv file) to compute a happiness score for each chapter.

Process and method:

  1. Get hedonometer happiness scores dataframe
  2. Create function that takes a list of tokens and returns the happiness score of the text (list). Words that do not exist in the labMT dataset can be overlooked.
  3. Remove all stopwords as these do not say much about the content of the text.
  4. Compute happiness scores and their standard deviation for each chapter.
  5. Vizualize the results (these are commented individually futher down).
  6. Create word shift visualisation for select chapters
In [ ]:
# Get hedonometer happiness scores
df_hedonometer = pd.read_csv('Hedonometer.csv', index_col=0) 
# Alter score to be from -5 to 5 instead of 0 to 10 
altered_happiness = [score-5 for score in df_hedonometer['Happiness Score']]
df_hedonometer['Happiness Score'] = altered_happiness
df_hedonometer

The happiness score is adjusted to be from -5 (most unhappy) to 5 (most happy) with 0 being the neutral middle.

In [ ]:
# Make a happiness dictionary from dataframe to use for lookup
happiness_score_dict = {row.Word: row['Happiness Score'] for index, row in df_hedonometer.iterrows()}

def GetHappinessScore(tokens):
    """
    Takes a list of token
    Returns the mean and standard deviation of the happiness score of the text

    """
    # Get list of scores for all the words from the list that exist in the labmt dataset
    happiness_score = [happiness_score_dict[token] for token in tokens if token in happiness_score_dict.keys()]
    # return mean if there are any words with a score, else return nan
    return np.mean(happiness_score) if bool(happiness_score) else np.nan , np.std(happiness_score) if bool(happiness_score) else np.nan
In [ ]:
#Get happiness score for all chapters
remove_stopwords = True # Remove stopwords from token list
# List for overall score each chapter 
chapter_happiness = []
# List for standard deviation of happiness score each chapter 
chapter_happiness_std = []
# List for lower case tokens
all_tokens_lower = []
for index, row in df.iterrows():
    # Lower case tokens and remove stopwords
    chapter_tokens = [token.lower() for token in row['plot_tokens'] if (remove_stopwords and token not in stopwords.words('english'))]
    all_tokens_lower.append(chapter_tokens)
    # Compute chapter happiness scores and standard deviation
    score, std = GetHappinessScore(chapter_tokens)
    # Append happiness score
    chapter_happiness.append(score)
    # Append its standard deviation
    chapter_happiness_std.append(std)
# Add to dataframe
df['happiness'] = chapter_happiness
df['happiness_std'] = chapter_happiness_std
df['tokens_cleaned'] = all_tokens_lower
In [ ]:
df

Plot series happiness¶

We plot the chapters as time on the x-axis and the mean happiness score along the y-axis. This show the development as the series progresses. The trendline is a linear fit. Interesting peaks and valleys are highlightet with a small description of the chapter.

The plot is created with and without standard deviation for comparison.

In [ ]:
# Setup for plots 
def setup_mpl():
    mpl.rcParams['font.size']=5
    mpl.rcParams['figure.figsize']=(4.5,2.5)
    mpl.rcParams['figure.dpi']=200    
setup_mpl()
In [ ]:
def plotSeriesHappiness(std, df):
    # Ongoing chapter number
    ch = [i for i in range(1,len(df)+1)]
    df['chapter_in_series'] = ch

    # Make trendline
    z = np.polyfit(df['chapter_in_series'], df['happiness'], 1)
    p = np.poly1d(z)

    # Plot
    fig, ax = plt.subplots()
    for i in range(1,8):
        dfbook = df[df['book']==i]
        ax.plot(dfbook['chapter_in_series'], dfbook['happiness'], color=colors[i-1], label=f"book {i}")
        if std:
            ax.fill_between(dfbook['chapter_in_series'], dfbook['happiness']+dfbook['happiness_std'], dfbook['happiness']-dfbook['happiness_std'], facecolor=colors[i-1], alpha=0.2)
    ax.plot(df['chapter_in_series'],p(df['chapter_in_series']),"k--", label="Trend line")
    # Chapters that could be of interest to look at 
    ax.text(42,df['happiness'][41], "Boggart class", size='small')
    ax.plot(42,df['happiness'][41],'r.')
    ax.text(64,df['happiness'][63], "Quidditch world cup", size='small')
    ax.plot(64,df['happiness'][63],'r.')
    ax.text(89,df['happiness'][88], "Cedric killed", size='small')
    ax.plot(89,df['happiness'][88],'r.')
    ax.text(116,df['happiness'][115], "Arthur in hospital after snake attack", size='small')
    ax.plot(116,df['happiness'][115],'r.')
    ax.text(119,df['happiness'][118], "People start believing Harry", size='small')
    ax.plot(119,df['happiness'][118],'r.')
    ax.text(129,df['happiness'][128], "Sirius killed by Bellatrix", size='small')
    ax.plot(129,df['happiness'][128],'r.')
    ax.text(148,df['happiness'][147], "Christmas at the burrow", size='small') # fake positive - read plot
    ax.plot(148,df['happiness'][147],'r.')
    ax.text(169,df['happiness'][168], "Harry's 17th birthday, kisses Ginny", size='small') 
    ax.plot(169,df['happiness'][168],'r.')
    ax.text(196,df['happiness'][195], "Harry goes to die", size='small') 
    ax.plot(196,df['happiness'][195],'r.')
    ax.text(161,df['happiness'][160], "Hospital wing after battle with Death Eaters", size='small') 
    ax.plot(161,df['happiness'][160],'r.')
    ax.legend(loc="lower left", ncol=3)
    ax.set_xlabel("Chapter in series")
    ax.set_ylabel("Mean happiness score")
    if std:
        ax.set_title("Series happiness with Standard deviation")
    else:
        ax.set_title("Series happiness")
    plt.show()
In [ ]:
plotSeriesHappiness(False, df)

It looks like most chapters score a little happier than neutral (score of 0) and that our hypothesis of the series getting less happy as time goes on is supported. The highlighted peaks and valleys will be explored further below.

What happens when we look at the standard deviation of the happiness score?

In [ ]:
plotSeriesHappiness(True, df)
In [ ]:
# How many tokens do we have in each chapter
print(f"Shortest chapter summary has {min([len(ch) for ch in df['tokens_cleaned']])} tokens. Longest chapter summary has {max([len(ch) for ch in df['tokens_cleaned']])} tokens")

The labMT dataset was created for and originally used on twitter-data. That is, on a much bigger dataset. Our cleaned chapter plot summaries are very short texts of 30-621 words. This is what causes the very large standard deviation on the mean happiness scores. Few individual words can have an outsized role on the score of a chapter. Looking at this plot, there isn't as much to say about the peaks, valleys or trends. Almost every chapter has a happiness score of 0.5 plus/minus 1.

Scores on the timescales of one book¶

We want to see if the within-book happiness trend is similar to the full-series trend. These scores are of course still subject to the standard deviations shown above.

In [ ]:
# Each book individually
fig, ax = plt.subplots()
for i in range(1,8):
    dfbook = df[df['book']==i]
    ax.plot(dfbook['chapter_number'], dfbook['happiness'], color=colors[i-1], label=f"book {i}")
ax.legend(loc="upper right", ncol=2)
ax.set_xlabel("Chapter in book")
ax.set_ylabel("Mean happiness score")
ax.set_title("Chapter happiness")
plt.show()

This is rather messy, so we look at the trendlines.

In [ ]:
fig, ax = plt.subplots()
for i in range(1,8):
    dfbook = df[df['book']==i]
    # normalise from number of chapters to [0,1]
    x = dfbook['chapter_number']
    # make trendline
    z = np.polyfit(x, dfbook['happiness'], 1)
    p = np.poly1d(z)
    ax.plot(x, p(x), color=colors[i-1], label=f"book {i}")
ax.legend(loc="lower left", ncol=3)
ax.set_xlabel("Chapter")
ax.set_ylabel("Mean happiness score")
ax.set_title("Book Happiness - Trendlines")
plt.show()

The trendlines show us that all the books have a negative happiness trend. This supports our hypothesis, but only vaguely as the actual difference in happiness score from beginning to end is very small. The books however have different chapter lengths, so below we normalise for a better comparison.

In [ ]:
fig, ax = plt.subplots()
for i in range(1,8):
    dfbook = df[df['book']==i]
    # normalise from number of chapters to [0,1]
    x = np.arange(0,1,1/len(dfbook))
    # make trendline
    z = np.polyfit(x, dfbook['happiness'], 1)
    p = np.poly1d(z)
    ax.plot(x, p(x), color=colors[i-1], label=f"book {i}")
ax.legend(loc="lower left", ncol=3)
ax.set_xlabel("Depth in book")
ax.set_ylabel("Mean happiness score")
ax.set_title("Book Happiness - Trendlines")
plt.show()

Now the books are comparable and we see that book 5 appears to have the steepest decline and book 7 starts out the least happy. We know from above that the mean-scores can seem to tell us more than they actually do, so we add the standard deviations back in as above.

In [ ]:
fig, ax = plt.subplots()
for i in range(1,8):
    dfbook = df[df['book']==i]
    x = np.arange(0,1,1/len(dfbook))
    # make trendline
    z = np.polyfit(x, dfbook['happiness'], 1)
    p = np.poly1d(z)
    ax.plot(x, p(x), color=colors[i-1], label=f"book {i}")
    ax.fill_between(x, dfbook['happiness']+dfbook['happiness_std'], dfbook['happiness']-dfbook['happiness_std'], facecolor=colors[i-1], alpha=0.2)
ax.legend(loc="lower left", ncol=3)
ax.set_xlabel("Depth in book")
ax.set_ylabel("Mean happiness score")
ax.set_title("Book Happiness - Trendlines with Standard Deviation")
plt.show()

This final plot tells us once again that the small quantity of words in each chapter makes for a large standard deviation. And really, with this data there maybe isn't all that much we can conclude based on the happiness score sentiment analysis.

Word shifts for select chapters¶

In [ ]:
# a function that computes and plots the word shifts
def createWordShift(df, d):
    # d is the chapters full-series number minus 1 (since the dataframe is zero-indexed)
    # get the clean tokens for the chapter of interest
    l = df['tokens_cleaned'][d] 
    # get clean tokes for the 3 chapters preceeding chapter d
    l_ref = []
    for i in range(3):
        l_ref.extend(df['tokens_cleaned'][d-3:d].values[i]) 
    # compute relative frequency for each token in list l
    p = dict([(item[0], item[1]/len(l)) for item in Counter(l).items()])
    # compute relative frequency for each token in list l_ref
    p_ref = dict([(item[0], item[1]/len(l)) for item in Counter(l_ref).items()])
    # get set of tokens
    all_tokens = set(p.keys()).union(set(p_ref.keys()))
    # compute difference between p_ref and p
    delta_p = dict([(token, p.get(token,0) - p_ref.get(token,0)) for token in all_tokens])
    # compute happiness score for each token
    h = dict([(token, happiness_score_dict.get(token, np.nan)) for token in all_tokens])
    # compute happiness times difference in relative frequency for each token
    d_phi = [(token, h[token]*delta_p[token]) for token in all_tokens if not np.isnan(h[token])]
    # plot wordshifts using shifterator
    sentiment_sh = sh.WeightedAvgShift(type2freq_1=p_ref,
                   type2freq_2=p,
                   type2score_1=happiness_score_dict,
                   reference_value=0)
    sentiment_sh.get_shift_graph(detailed=True,
                            system_names = ['Previous 3 chapters',f'Book {df.book[d]}, chapter {df.chapter_number[d]}'])

The word shifts for these select chapters show what words are causing them to stand out as particularly positive or negative compared to the previous 3 chapters. We have picked 4 chapters that stand out as particularly positive or negative in the series-happiness plot above to look at a little closer

In [ ]:
chapters = [88, 128, 147, 168]
chapter_string = ["Cedric killed", "Sirius killed by Bellatrix",  "Christmas at the burrow", "Harry's 17th birthday, kisses Ginny"]

for idx, chapter in enumerate(chapters):
    print(chapter_string[idx])
    createWordShift(df, chapter)

The two first chapters have rather negative shifts. This is expected as a good character is killed by evil forces. Words like "kill", "grave", horror", "dead", "defeated" and "battle" olay a large role in these negative shifts.

the two last chapters have positive shifts. These are about a Christmas and a birthday. Words like "Christmas", "holidays", "birthday", "wedding", and "kisses" pull these chapters in the positive direction.

The labMT dataset is not made for Harry Potter, so some words that take on a specific meaning in the wizarding world may have more colloquial meanings in general English. An example of this is "lord" which in Harry Potter always refers to Lord Voldemort, and certainly could be categorised as unhappy, is shown in the first highlighted chapter to be a very positive word, probably because people have religious associations with it. Another word where the reverse is true is "snitch" - rather negative in labMT happiness score, where it in Harry Potter refers to a ball in quidditch and thus not negative at all.

Another thing worth noting is that much like the happiness plots above, these word shifts are affected by the very small amount of words in each chapter. Individual words can have an outsized effect.

The method also has its limitations. It cannot understand context and sometimes seemingly positive words can be put together to form obviously negative sentences. The Christmas chapter we have picked out here is an example of this. Some of the positive words like "boost", "innocent" and "ministry" are actually not that positive when you get them in context - see the summary below.

In [ ]:
df.summary[147]

Discussion¶

What went well?¶

We chose a topic we are all very interested in which made it super fun to work with. It made it easy to see patterns and made us able to make an extensive analysis due to our domain knowledge.

What could be improved?¶

We found it difficult to work with the website and insert dynamic visualizations, interactive analysis, etc. as this was not introduced in class.

The drawback of community partion methods is that they have a tendency to let small clusters be absorbed by larger ones. This would make our community analysis less nuanced.

We worked with summaries and not the full texts. A rule of thumb is that more data lead to more information. We noticed that only 183 of of 403 characters are mentioned in the summaries which influence the networks. Yet, one have to assume that if they are not mentioned in the summaries, they are not important characters and thus irrelevant. Using book chapters instead of summaries would provide more words which could have resultet in a more informative sentiment analysis. A work of fiction might also use more expressive words than a summary, which again might have affected the sentiment analysis.

The sentiment analysis could have benefitted from a better tokenisation/stemming process which would have made more words available for happiness score calculations. This might have improved the uncertainty a bit and reduced the large standard deviations.

References¶

  • Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python. O’Reilly Media.
  • Beautiful Soup Documentation: https://beautiful-soup-4.readthedocs.io/en/latest/
  • Netwulf Documentation: https://netwulf.readthedocs.io/en/latest/#
  • NetworkX Documentation: https://networkx.org/documentation/networkx-1.9/#

Data from

  • Harry Potter Fandom wiki https://harrypotter.fandom.com/wiki/Main_Page
  • Harry Potter characters http://hp-api.herokuapp.com/
  • Happiness Scores from Hedonometer https://hedonometer.org/words/labMT-en-v2/